Background

The file ‘WeatherEDA_202005_v002.Rmd’ contains exploratory data analysis for historical weather data as contained in METAR archives hosted by Iowa State University.

Data have been dowloaded, processed, cleaned, and integrated for several stations (airports) and years, with .rds files saved in “./RInputFiles/ProcessedMETAR”.

This module will perform initial modeling on the processed weather files. It builds on the previous ‘WeatherModeling_202006_v001.Rmd’ and leverages functions in ‘WeatherModelingFunctions_v001.R’.

Data Availability

There are three main processed files available for further exploration:

metar_postEDA_20200617.rds and metar_postEDA_extra_20200627.rds

  • source (chr) - the reporting station and time
  • locale (chr) - the descriptive name for source
  • dtime (dttm) - the date-time for the observation
  • origMETAR (chr) - the original METAR associated with the observation at that source and date-time
  • year (dbl) - the year, extracted from dtime
  • monthint (dbl) - the month, extracted from dtime, as an integer
  • month (fct) - the month, extracted from dtime, as a three-character abbreviation (factor)
  • day (int) - the day of the month, extracted from dtime
  • WindDir (chr) - previaling wind direction in degrees, stored as a character since ‘VRB’ means variable
  • WindSpeed (int) - the prevailing wind speed in knots
  • WindGust (dbl) - the wind gust speed in knots (NA if there is no recorded wind gust at that hour)
  • predomDir (chr) - the predominant wind direction as NE-E-SE-S-SW-W-NW-N-VRB-000-Error
  • Visibility (dbl) - surface visibility in statute miles
  • Altimeter (dbl) - altimeter in inches of mercury
  • TempF (dbl) - the Fahrenheit temperature
  • DewF (dbl) - the Fahrenheit dew point
  • modSLP (dbl) - Sea-Level Pressure (SLP), adjusted to reflect that SLP is recorded as 0-1000 but reflects data that are 950-1050
  • cTypen (chr) - the cloud type of the nth cloud layer (FEW, BKN, SCT, OVC, or VV)
  • cLeveln (dbl) - the cloud height in feet of the nth cloud layer
  • isRain (lgl) - was rain occurring at the moment the METAR was captured?
  • isSnow (lgl) - was snow occurring at the moment the METAR was captured?
  • isThunder (lgl) - was thunder occurring at the moment the METAR was captured?
  • p1Inches (dbl) - how many inches of rain occurred in the past hour?
  • p36Inches (dbl) - how many inches of rain occurred in the past 3/6 hours (3-hour summaries at 3Z-9Z-15Z-21Z and 6-hour summaries at 6Z-12Z-18Z-24Z and NA at any other Z times)?
  • p24Inches (dbl) - how many inches of rain occurred in the past 24 hours (at 12Z, NA at all other times)
  • tempFHi (dbl) - the high temperature in the past 24 hours, in Fahrenheit (reported once per day)
  • tempFLo (dbl) - the low temperature in the past 24 hours, in Fahrenheit (reported once per day)
  • minHeight (dbl) - the minimum cloud height in feet (-100 means ‘no clouds’)
  • minType (fct) - amount of obscuration at the minimum cloud height (VV > OVC > BKN > SCT > FEW > CLR)
  • ceilingHeight (dbl) - the minimum cloud ceiling in feet (-100 means ‘no ceiling’)
  • ceilingType (fct) - amount of obscuration at the minimum ceiling height (VV > OVC > BKN)

metar_modifiedClouds_20200617.rds and metar_modifiedclouds_extra_20200627.rds

  • source (chr) - the reporting station and time
  • sourceName (chr) - the descriptive name for source
  • dtime (dttm) - the date-time for the observation
  • level (dbl) - cloud level (level 0 is inserted for every source-dtime as a base layer of clear)
  • height (dbl) - level height (height -100 is inserted for every source-dtime as a base layer of clear)
  • type (dbl) - level type (type CLR is inserted for every source-dtime as a base layer of clear)

metar_precipLists_20200617.rds and metar_precipLists_extra_20200627.rds

  • Contains elements for each of rain/snow/thunder for each of 2015/2016/2017
  • Each element contains a list and a tibble
  • The tibble is precipLength and contains precipitation by month as source-locale-month-hours-events
  • The list is precipList and contains data on each precipitation interval

Several mapping files are defined for use in plotting; tidyverse, lubridate, and caret are loaded; and the relevant functions are sourced:

# The process frequently uses tidyverse, lubridate, caret, and randomForest
library(tidyverse)
## -- Attaching packages ------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.2.1     v purrr   0.3.3
## v tibble  2.1.3     v dplyr   0.8.4
## v tidyr   1.0.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0
## -- Conflicts ---------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
# The main path for the files
filePath <- "./RInputFiles/ProcessedMETAR/"


# Sourcing functions
source("./WeatherModelingFunctions_v001.R")


# Descriptive names for key variables
varMapper <- c(source="Original source file name", 
               locale="Descriptive name",
               dtime="Date-Time (UTC)",
               origMETAR="Original METAR",
               year="Year",
               monthint="Month",
               month="Month", 
               day="Day of Month",
               WindDir="Wind Direction (degrees)", 
               WindSpeed="Wind Speed (kts)",
               WindGust="Wind Gust (kts)",
               predomDir="General Prevailing Wind Direction",
               Visibility="Visibility (SM)", 
               Altimeter="Altimeter (inches Hg)",
               TempF="Temperature (F)",
               DewF="Dew Point (F)", 
               modSLP="Sea-Level Pressure (hPa)", 
               cType1="First Cloud Layer Type", 
               cLevel1="First Cloud Layer Height (ft)",
               isRain="Rain at Observation Time",
               isSnow="Snow at Observation Time",
               isThunder="Thunder at Obsevation Time",
               tempFHi="24-hour High Temperature (F)",
               tempFLo="24-hour Low Temperature (F)",
               minHeight="Minimum Cloud Height (ft)",
               minType="Obscuration Level at Minimum Cloud Height",
               ceilingHeight="Minimum Ceiling Height (ft)",
               ceilingType="Obscuration Level at Minimum Ceiling Height", 
               hr="Hour of Day (Zulu time)"
               )


# File name to city name mapper
cityNameMapper <- c(katl_2016="Atlanta, GA (2016)",
                    kbos_2016="Boston, MA (2016)", 
                    kdca_2016="Washington, DC (2016)", 
                    kden_2016="Denver, CO (2016)", 
                    kdfw_2016="Dallas, TX (2016)", 
                    kdtw_2016="Detroit, MI (2016)", 
                    kewr_2016="Newark, NJ (2016)",
                    kgrb_2016="Green Bay, WI (2016)",
                    kgrr_2016="Grand Rapids, MI (2016)",
                    kiah_2016="Houston, TX (2016)",
                    kind_2016="Indianapolis, IN (2016)",
                    klas_2014="Las Vegas, NV (2014)",
                    klas_2015="Las Vegas, NV (2015)",
                    klas_2016="Las Vegas, NV (2016)", 
                    klas_2017="Las Vegas, NV (2017)", 
                    klas_2018="Las Vegas, NV (2018)",
                    klas_2019="Las Vegas, NV (2019)",
                    klax_2016="Los Angeles, CA (2016)", 
                    klnk_2016="Lincoln, NE (2016)",
                    kmia_2016="Miami, FL (2016)", 
                    kmke_2016="Milwaukee, WI (2016)",
                    kmsn_2016="Madison, WI (2016)",
                    kmsp_2016="Minneapolis, MN (2016)",
                    kmsy_2014="New Orleans, LA (2014)",
                    kmsy_2015="New Orleans, LA (2015)",
                    kmsy_2016="New Orleans, LA (2016)", 
                    kmsy_2017="New Orleans, LA (2017)", 
                    kmsy_2018="New Orleans, LA (2018)",
                    kmsy_2019="New Orleans, LA (2019)",
                    kord_2014="Chicago, IL (2014)",
                    kord_2015="Chicago, IL (2015)",
                    kord_2016="Chicago, IL (2016)", 
                    kord_2017="Chicago, IL (2017)", 
                    kord_2018="Chicago, IL (2018)",
                    kord_2019="Chicago, IL (2019)",
                    kphl_2016="Philadelphia, PA (2016)", 
                    kphx_2016="Phoenix, AZ (2016)", 
                    ksan_2014="San Diego, CA (2014)",
                    ksan_2015="San Diego, CA (2015)",
                    ksan_2016="San Diego, CA (2016)",
                    ksan_2017="San Diego, CA (2017)",
                    ksan_2018="San Diego, CA (2018)",
                    ksan_2019="San Diego, CA (2019)",
                    ksat_2016="San Antonio, TX (2016)", 
                    ksea_2016="Seattle, WA (2016)", 
                    ksfo_2016="San Francisco, CA (2016)", 
                    ksjc_2016="San Jose, CA (2016)",
                    kstl_2016="Saint Louis, MO (2016)", 
                    ktpa_2016="Tampa Bay, FL (2016)", 
                    ktvc_2016="Traverse City, MI (2016)"
                    )

# File names in 2016, based on cityNameMapper
names_2016 <- grep(names(cityNameMapper), pattern="[a-z]{3}_2016", value=TRUE)

The main data will be from the metar_postEDA files. They are integrated below, and cloud and ceiling heights are converted to factors:

# Main weather data
metarData <- readRDS("./RInputFiles/ProcessedMETAR/metar_postEDA_20200617.rds") %>%
    bind_rows(readRDS("./RInputFiles/ProcessedMETAR/metar_postEDA_extra_20200627.rds")) %>%
    mutate(orig_minHeight=minHeight, 
           orig_ceilingHeight=ceilingHeight, 
           minHeight=mapCloudHeight(minHeight), 
           ceilingHeight=mapCloudHeight(ceilingHeight)
           )
glimpse(metarData)
## Observations: 437,417
## Variables: 43
## $ source             <chr> "kdtw_2016", "kdtw_2016", "kdtw_2016", "kdtw_201...
## $ locale             <chr> "Detroit, MI (2016)", "Detroit, MI (2016)", "Det...
## $ dtime              <dttm> 2016-01-01 00:53:00, 2016-01-01 01:53:00, 2016-...
## $ origMETAR          <chr> "KDTW 010053Z 23012KT 10SM OVC020 00/M05 A3021 R...
## $ year               <dbl> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, ...
## $ monthint           <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ month              <fct> Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan...
## $ day                <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ WindDir            <chr> "230", "230", "230", "240", "230", "220", "220",...
## $ WindSpeed          <int> 12, 12, 11, 14, 16, 13, 14, 16, 13, 16, 17, 13, ...
## $ WindGust           <dbl> NA, NA, NA, 23, 22, NA, 20, 20, NA, 22, NA, NA, ...
## $ predomDir          <fct> SW, SW, SW, SW, SW, SW, SW, SW, SW, SW, SW, SW, ...
## $ Visibility         <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 8, 5, 7, 8, 10, ...
## $ Altimeter          <dbl> 30.21, 30.21, 30.19, 30.19, 30.18, 30.16, 30.14,...
## $ TempF              <dbl> 32.00, 32.00, 32.00, 30.92, 30.92, 32.00, 30.92,...
## $ DewF               <dbl> 23.00, 21.92, 21.02, 19.94, 19.94, 19.94, 19.94,...
## $ modSLP             <dbl> 1023.6, 1023.5, 1023.0, 1023.0, 1022.7, 1022.0, ...
## $ cType1             <chr> "OVC", "OVC", "OVC", "OVC", "OVC", "OVC", "OVC",...
## $ cType2             <chr> "", "", "", "", "", "", "", "", "", "OVC", "OVC"...
## $ cType3             <chr> "", "", "", "", "", "", "", "", "", "", "", "", ...
## $ cType4             <chr> "", "", "", "", "", "", "", "", "", "", "", "", ...
## $ cType5             <chr> "", "", "", "", "", "", "", "", "", "", "", "", ...
## $ cType6             <chr> "", "", "", "", "", "", "", "", "", "", "", "", ...
## $ cLevel1            <dbl> 2000, 2000, 2200, 2500, 2700, 2500, 2500, 2500, ...
## $ cLevel2            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 4500, 4000, ...
## $ cLevel3            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ cLevel4            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ cLevel5            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ cLevel6            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ isRain             <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ isSnow             <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ isThunder          <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ p1Inches           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 0, 0, 0, 0, 0, N...
## $ p36Inches          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 0, NA, NA, 0, NA...
## $ p24Inches          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ tempFHi            <dbl> NA, NA, NA, NA, 36, NA, NA, NA, NA, NA, NA, NA, ...
## $ tempFLo            <dbl> NA, NA, NA, NA, 31, NA, NA, NA, NA, NA, NA, NA, ...
## $ minHeight          <fct> Low, Low, Low, Low, Low, Low, Low, Low, Low, Low...
## $ minType            <fct> OVC, OVC, OVC, OVC, OVC, OVC, OVC, OVC, OVC, BKN...
## $ ceilingHeight      <fct> Low, Low, Low, Low, Low, Low, Low, Low, Low, Low...
## $ ceilingType        <fct> OVC, OVC, OVC, OVC, OVC, OVC, OVC, OVC, OVC, BKN...
## $ orig_minHeight     <dbl> 2000, 2000, 2200, 2500, 2700, 2500, 2500, 2500, ...
## $ orig_ceilingHeight <dbl> 2000, 2000, 2200, 2500, 2700, 2500, 2500, 2500, ...

Initial Exploration

An initial exploration of the data can use hierarchical clustering based on several numerical features of the data (temperature, dew point, sea-level pressure, wind speed) by month.

Example code includes:

# Create hierarchical clustering for 2016 data
# Distance is calculated only where month is the same
distData <- metarData %>% 
    filter(year==2016) %>%
    select(locale, month, TempF, DewF, modSLP, WindSpeed) %>%
    group_by(locale, month) %>% 
    summarize_if(is.numeric, mean, na.rm=TRUE) %>% 
    ungroup() %>% 
    mutate(rowname=paste0(locale, "_", month)) %>% 
    column_to_rownames() %>% 
    select(-locale, -month) %>% 
    as.matrix() %>% 
    scale() %>% 
    dist() %>% 
    as.matrix() %>% 
    as.data.frame() %>%
    rownames_to_column(var="locale1") %>% 
    pivot_longer(-locale1, names_to="locale2", values_to="dist") %>% 
    mutate(city1=str_replace(locale1, "_\\w{3}", ""), 
           city2=str_replace(locale2, "_\\w{3}", ""), 
           month1=str_sub(locale1, -3), 
           month2=str_sub(locale2, -3)) %>%
    filter(month1==month2) %>% 
    group_by(city1, city2) %>% 
    summarize(meandist=mean(dist), sddist=sd(dist)) %>% 
    ungroup() %>% 
    select(-sddist) %>% 
    pivot_wider(city1, names_from="city2", values_from="meandist") %>%
    column_to_rownames(var="city1") %>%
    as.matrix() %>%
    as.dist(diag=FALSE)

distData %>%
    hclust(method="complete") %>%
    plot()

distData %>%
    hclust(method="single") %>%
    plot()

At a glance, there are several sensible findings from the clustering:

  • Two combinations of cities in close geographic proximity (Newark and Philadelphia; Chicago and Milaukee) show the greatest similarity in the clustering
  • A cluster of coastal California cities (San Diego, Los Angeles, San Jose) show strong similarities
  • A cluster of hot-humid cities (Houston-New Orleans and then soon Miami-Tampa) show strong similarities
  • Several clusters of cold-weather cities emerge
  • Two desert cities (Las Vega, Phoenix) show similarities

There are a handful of cities that do not seem to cluster with anything else:

  • Denver, Boston, Seattle, San Francisco

Temperature and Dew Point

Suppose that the only information available about two cities were their temperatures and dew points, with month also included. How well would a basic random forest, with mtry=3, classify the cities?

The function is then run for every combination of locales from 2016 in cityNameMapper. A common random seed is applied to every run of the process:

# Create a container list to hold the output
list_2016_TempF_DewF_month <- vector("list", 0.5*length(names_2016)*(length(names_2016)-1))

set.seed(2006281443)

n <- 1
for (ctr in 1:(length(names_2016)-1)) {
    for (ctr2 in (ctr+1):length(names_2016)) {
        list_2016_TempF_DewF_month[[n]] <- rfTwoLocales(metarData, 
                                                        loc1=names_2016[ctr], 
                                                        loc2=names_2016[ctr2], 
                                                        vrbls=c("TempF", "DewF", "month"),
                                                        ntree=25
                                                        )
        n <- n + 1
        if ((n %% 50) == 0) { cat("Through number:", n, "\n")}
    }
}
## Through number: 50 
## Through number: 100 
## Through number: 150 
## Through number: 200 
## Through number: 250 
## Through number: 300 
## Through number: 350 
## Through number: 400

Accuracy can then be assessed:

# Create a tibble from the underlying accuracy data
acc_TempF_DewF_month <- map_dfr(list_2016_TempF_DewF_month, .f=helperAccuracyLocale)

# Assess the top 10 classification accuracies
acc_TempF_DewF_month %>%
    arrange(-accOverall) %>%
    head(10)
## # A tibble: 10 x 5
##    locale1              locale2                 accOverall accLocale1 accLocale2
##    <chr>                <chr>                        <dbl>      <dbl>      <dbl>
##  1 Denver, CO (2016)    Miami, FL (2016)             0.997      0.997      0.997
##  2 Las Vegas, NV (2016) Miami, FL (2016)             0.990      0.991      0.989
##  3 Miami, FL (2016)     Seattle, WA (2016)           0.985      0.985      0.985
##  4 Denver, CO (2016)    Tampa Bay, FL (2016)         0.984      0.986      0.982
##  5 Miami, FL (2016)     Traverse City, MI (201~      0.980      0.981      0.980
##  6 Miami, FL (2016)     Minneapolis, MN (2016)       0.978      0.978      0.978
##  7 Boston, MA (2016)    Miami, FL (2016)             0.976      0.975      0.976
##  8 Denver, CO (2016)    New Orleans, LA (2016)       0.975      0.975      0.975
##  9 Miami, FL (2016)     Phoenix, AZ (2016)           0.974      0.973      0.975
## 10 Green Bay, WI (2016) Miami, FL (2016)             0.971      0.972      0.971
# Assess the bottom 10 classification accuracies
acc_TempF_DewF_month %>%
    arrange(accOverall) %>%
    head(10)
## # A tibble: 10 x 5
##    locale1                locale2               accOverall accLocale1 accLocale2
##    <chr>                  <chr>                      <dbl>      <dbl>      <dbl>
##  1 Newark, NJ (2016)      Philadelphia, PA (20~      0.580      0.591      0.570
##  2 Chicago, IL (2016)     Milwaukee, WI (2016)       0.591      0.600      0.583
##  3 Chicago, IL (2016)     Madison, WI (2016)         0.598      0.622      0.573
##  4 Detroit, MI (2016)     Grand Rapids, MI (20~      0.608      0.568      0.649
##  5 Chicago, IL (2016)     Detroit, MI (2016)         0.610      0.633      0.587
##  6 Grand Rapids, MI (201~ Traverse City, MI (2~      0.613      0.630      0.597
##  7 Madison, WI (2016)     Milwaukee, WI (2016)       0.618      0.625      0.611
##  8 Green Bay, WI (2016)   Madison, WI (2016)         0.619      0.586      0.652
##  9 Detroit, MI (2016)     Madison, WI (2016)         0.619      0.639      0.598
## 10 Chicago, IL (2016)     Grand Rapids, MI (20~      0.621      0.647      0.595
allAccuracy_month <- select(acc_TempF_DewF_month, 
                            locale=locale1, 
                            other=locale2, 
                            accOverall, 
                            accLocale=accLocale1
                            ) %>%
    bind_rows(select(acc_TempF_DewF_month, 
                     locale=locale2, 
                     other=locale1, 
                     accOverall, 
                     accLocale=accLocale2
                     )
              )

# Overall accuracy by location plot
allAccuracy_month %>%
    group_by(locale) %>%
    summarize_if(is.numeric, mean) %>%
    ggplot(aes(x=fct_reorder(locale, accOverall), y=accOverall)) + 
    geom_point(size=2) + 
    geom_text(aes(y=accOverall+0.02, label=paste0(round(100*accOverall), "%"))) +
    coord_flip() + 
    labs(x="", 
         y="Average Accuracy", 
         title="Average Accuracy Predicting Locale",
         subtitle="Predictions made 1:1 to each other locale (average accuracy reported)",
         caption="Temperature, Dew Point, Month as predictors\n(50% is baseline coinflip accuracy)"
         ) + 
    ylim(c(0.5, 1))

# Overall accuracy heatmap (FIX for better ordering of locales)
# allAccuracy_month %>% 
#     ggplot(aes(x=locale, y=other)) + 
#     geom_tile(aes(fill=accOverall)) + 
#     theme(axis.text.x=element_text(angle=90)) + 
#     scale_fill_continuous("Accuracy", high="darkblue", low="white") + 
#     labs(title="Accuracy Predicting Locale vs. Locale", 
#          caption="Temperature, Dew Point, and Month as predictors\n(50% is baseline coinflip accuracy)",
#          x="",
#          y=""
#          )

Accuracy is best for cities with significantly different locales. The model is especially successful at pulling apart the desert cities (Las Vegas, Phoenix) from everything else, and the humid Florida cities (Miami, Tampa Bay) from everything else.

Next, the simple model is run to classify locale across the full 2016 dataset:

# Run random forest for all 2016 locales
rf_all_2016_TDm <- rfMultiLocale(metarData, 
                                 vrbls=c("TempF", "DewF", "month"),
                                 locs=names_2016, 
                                 ntree=50, 
                                 seed=2006281508
                                 )

Summaries can then be created for the accuracy in predicting each locale:

evalPredictions(rf_all_2016_TDm, plotCaption = "Temp, Dew Point, Month, Cloud/Ceiling Height")

## # A tibble: 898 x 5
##    locale             predicted               correct     n    pct
##    <fct>              <fct>                   <lgl>   <int>  <dbl>
##  1 Atlanta, GA (2016) Atlanta, GA (2016)      TRUE      484 0.182 
##  2 Atlanta, GA (2016) Boston, MA (2016)       FALSE      49 0.0185
##  3 Atlanta, GA (2016) Chicago, IL (2016)      FALSE      39 0.0147
##  4 Atlanta, GA (2016) Dallas, TX (2016)       FALSE     179 0.0675
##  5 Atlanta, GA (2016) Denver, CO (2016)       FALSE      34 0.0128
##  6 Atlanta, GA (2016) Detroit, MI (2016)      FALSE      42 0.0158
##  7 Atlanta, GA (2016) Grand Rapids, MI (2016) FALSE      48 0.0181
##  8 Atlanta, GA (2016) Green Bay, WI (2016)    FALSE      42 0.0158
##  9 Atlanta, GA (2016) Houston, TX (2016)      FALSE      96 0.0362
## 10 Atlanta, GA (2016) Indianapolis, IN (2016) FALSE      49 0.0185
## # ... with 888 more rows

Accuracy is meaningfully increased compared to the null accuracy, though there is still under 50% accuracy in classifying any locale. This is suggestive that the goal will be to define some archetype climates, and to use those for predicting (e.g., humid, cold, desert, etc.).

Clouds (minimum levels and ceiling heights) can potentially help further differentiate the cold weather cities on the downwind side of the lake, as well as helping to further pull apart various marine and mid-continental climates.

An updated random forest model is then run:

# Run random forest for all 2016 locales
rf_all_2016_TDmc <- rfMultiLocale(metarData, 
                                  vrbls=c("TempF", "DewF", "month", "minHeight", "ceilingHeight"),
                                  locs=names_2016, 
                                  ntree=50, 
                                  seed=2006281509
                                  )

The evaluation process is converted to a function:

evalPredictions(rf_all_2016_TDmc, plotCaption = "Temp, Dew Point, Month, Cloud/Ceiling Height")

## # A tibble: 897 x 5
##    locale             predicted               correct     n     pct
##    <fct>              <fct>                   <lgl>   <int>   <dbl>
##  1 Atlanta, GA (2016) Atlanta, GA (2016)      TRUE      643 0.240  
##  2 Atlanta, GA (2016) Boston, MA (2016)       FALSE      84 0.0313 
##  3 Atlanta, GA (2016) Chicago, IL (2016)      FALSE      61 0.0227 
##  4 Atlanta, GA (2016) Dallas, TX (2016)       FALSE     145 0.0541 
##  5 Atlanta, GA (2016) Denver, CO (2016)       FALSE      44 0.0164 
##  6 Atlanta, GA (2016) Detroit, MI (2016)      FALSE      51 0.0190 
##  7 Atlanta, GA (2016) Grand Rapids, MI (2016) FALSE      49 0.0183 
##  8 Atlanta, GA (2016) Green Bay, WI (2016)    FALSE      26 0.00969
##  9 Atlanta, GA (2016) Houston, TX (2016)      FALSE      82 0.0306 
## 10 Atlanta, GA (2016) Indianapolis, IN (2016) FALSE     100 0.0373 
## # ... with 887 more rows
# Get the comparison to previous accuracy
deltaAccuracy(prevObject=rf_all_2016_TDm, currObject=rf_all_2016_TDmc)

Accuracy increases meaningfully, though still well under 50% in most cases.

The model can also be built out to consider wind speed and wind direction. No attempt yet is made to control for over-fitting:

# Run random forest for all 2016 locales
rf_all_2016_TDmcw <- rfMultiLocale(metarData, 
                                   vrbls=c("TempF", "DewF", 
                                           "month", 
                                           "minHeight", "ceilingHeight", 
                                           "WindSpeed", "predomDir"
                                           ),
                                   locs=names_2016, 
                                   ntree=50, 
                                   seed=2006281934
                                   )

The evaluation process can again be run:

# Get the graphs of the prediction accuracy
evalPredictions(rf_all_2016_TDmcw, 
                plotCaption = "Temp, Dew Point, Month, Cloud/Ceiling Height, Wind Speed/Direction"
                )

## # A tibble: 889 x 5
##    locale             predicted               correct     n     pct
##    <fct>              <fct>                   <lgl>   <int>   <dbl>
##  1 Atlanta, GA (2016) Atlanta, GA (2016)      TRUE     1096 0.415  
##  2 Atlanta, GA (2016) Boston, MA (2016)       FALSE      34 0.0129 
##  3 Atlanta, GA (2016) Chicago, IL (2016)      FALSE      46 0.0174 
##  4 Atlanta, GA (2016) Dallas, TX (2016)       FALSE     135 0.0511 
##  5 Atlanta, GA (2016) Denver, CO (2016)       FALSE      28 0.0106 
##  6 Atlanta, GA (2016) Detroit, MI (2016)      FALSE      31 0.0117 
##  7 Atlanta, GA (2016) Grand Rapids, MI (2016) FALSE      26 0.00984
##  8 Atlanta, GA (2016) Green Bay, WI (2016)    FALSE      24 0.00908
##  9 Atlanta, GA (2016) Houston, TX (2016)      FALSE      58 0.0219 
## 10 Atlanta, GA (2016) Indianapolis, IN (2016) FALSE      59 0.0223 
## # ... with 879 more rows
# Get the comparison to previous accuracy
deltaAccuracy(prevObject=rf_all_2016_TDmc, currObject=rf_all_2016_TDmcw)

Including wind significantly improves model accuracy for many locales. Even the cold weather cities are now being predicted with around 30%-40% accucacy.

The model can also be built out to consider hour of day and sea-level pressure. No attempt yet is made to control for over-fitting, with the exception that mtry is held at 4:

# Run random forest for all 2016 locales
rf_all_2016_TDmcwa <- rfMultiLocale(metarData, 
                                    vrbls=c("TempF", "DewF", 
                                            "month",
                                            "minHeight", "ceilingHeight", 
                                            "WindSpeed", "predomDir",
                                            "modSLP"
                                            ),
                                    locs=names_2016, 
                                    ntree=50, 
                                    mtry=4,
                                    seed=2006282103
                                    )

The evaluation process can again be run:

# Get the graphs of the prediction accuracy
evalPredictions(rf_all_2016_TDmcwa, 
                plotCaption = "Temp, Dew Point, Month, Cloud/Ceiling Height, Wind Speed/Direction, SLP"
                )

## # A tibble: 880 x 5
##    locale             predicted               correct     n     pct
##    <fct>              <fct>                   <lgl>   <int>   <dbl>
##  1 Atlanta, GA (2016) Atlanta, GA (2016)      TRUE     1639 0.625  
##  2 Atlanta, GA (2016) Boston, MA (2016)       FALSE      16 0.00610
##  3 Atlanta, GA (2016) Chicago, IL (2016)      FALSE      23 0.00877
##  4 Atlanta, GA (2016) Dallas, TX (2016)       FALSE      75 0.0286 
##  5 Atlanta, GA (2016) Denver, CO (2016)       FALSE       8 0.00305
##  6 Atlanta, GA (2016) Detroit, MI (2016)      FALSE      18 0.00686
##  7 Atlanta, GA (2016) Grand Rapids, MI (2016) FALSE      12 0.00457
##  8 Atlanta, GA (2016) Green Bay, WI (2016)    FALSE       6 0.00229
##  9 Atlanta, GA (2016) Houston, TX (2016)      FALSE      45 0.0172 
## 10 Atlanta, GA (2016) Indianapolis, IN (2016) FALSE      51 0.0194 
## # ... with 870 more rows
# Get the comparison to previous accuracy
deltaAccuracy(prevObject=rf_all_2016_TDmcw, currObject=rf_all_2016_TDmcwa)

Adding sea-level pressure again significantly improves prediction ability. The cold weather cities are in the roughly 40%-60% accuracy range, and the other cities are in the roughly 60%-80% accuracy range.

Based on the hierarchical clustering and results of the initial random forest models, a few archetype cities are defined in an attempt to pull out distinctions:

  • Miami (likely to match with Tampa, New Orleans, and Houston)
  • Las Vegas (likely to match with Phoenix)
  • Los Angeles (likely to match with San Diego and San Jose)
  • Denver (seems potentially isolated)
  • Seattle (seems potentially isolated)
  • Dallas (likely to match to San Antonio)
  • Minneapolis (surrogate for the cold side of cold weather cities)
  • Atlanta (possible to match with DC, St Louis, Indianapolis, but may be too similar to Cold/Miami)

A data subset is created, with “hr” added for the Zulu hour of the observation (as integer rather than as factor):

archeCities <- c("kmia_2016", "klas_2016", "klax_2016", "kden_2016", 
                 "ksea_2016", "kdfw_2016", "kmsp_2016", "katl_2016"
                 )

archeData <- metarData %>%
    filter(source %in% archeCities) %>%
    mutate(hr=lubridate::hour(dtime))

A random forest, with mtry=4, is then run using all variables from previous, as well as hour:

# Run random forest for all 2016 locales
rf_arche_2016_TDmcwa <- rfMultiLocale(archeData, 
                                      vrbls=c("TempF", "DewF", 
                                              "month", "hr",
                                              "minHeight", "ceilingHeight", 
                                              "WindSpeed", "predomDir",
                                              "modSLP"
                                              ),
                                      locs=NULL, # defaults to running for all locales
                                      ntree=100, 
                                      mtry=4,
                                      seed=2006291353
                                      )
## 
## Running for locations:
## [1] "katl_2016" "kden_2016" "kdfw_2016" "klas_2016" "klax_2016" "kmia_2016"
## [7] "kmsp_2016" "ksea_2016"

The evaluation process can again be run:

# Get the graphs of the prediction accuracy
evalPredictions(rf_arche_2016_TDmcwa, 
                plotCaption = "Temp, Dew Point, Month/Hour, Cloud/Ceiling Height, Wind Speed/Direction, SLP"
                )

## # A tibble: 62 x 5
##    locale             predicted              correct     n    pct
##    <fct>              <fct>                  <lgl>   <int>  <dbl>
##  1 Atlanta, GA (2016) Atlanta, GA (2016)     TRUE     2204 0.835 
##  2 Atlanta, GA (2016) Dallas, TX (2016)      FALSE     139 0.0527
##  3 Atlanta, GA (2016) Denver, CO (2016)      FALSE      39 0.0148
##  4 Atlanta, GA (2016) Las Vegas, NV (2016)   FALSE      33 0.0125
##  5 Atlanta, GA (2016) Los Angeles, CA (2016) FALSE      86 0.0326
##  6 Atlanta, GA (2016) Miami, FL (2016)       FALSE      50 0.0189
##  7 Atlanta, GA (2016) Minneapolis, MN (2016) FALSE      45 0.0170
##  8 Atlanta, GA (2016) Seattle, WA (2016)     FALSE      44 0.0167
##  9 Dallas, TX (2016)  Atlanta, GA (2016)     FALSE     203 0.0789
## 10 Dallas, TX (2016)  Dallas, TX (2016)      TRUE     2105 0.818 
## # ... with 52 more rows

The model is reasonably successful in pulling apart the archetypes. Denver and Dallas seem potentially less distinct, though all archetypes cities are predicted at 80%+ accuracy.

The remaining cities can be classified against the archetype data, to explore any patterns that emerge:

localeByArchetype(rf_arche_2016_TDmcwa, 
                  fullData=metarData, 
                  archeCities=archeCities, 
                  sortDescMatch=TRUE
                  )

There appear to be several issues:

  • Atlanta and Dallas are not particularly distinct and do not make for good archetypes (potentially, Miami, Atlanta, and Dallas can all be replaced by Houston); Denver and Seattle are shaky but may be OK
  • Minneapolis may be too cold, driving the midwestern cities to instead classify partially as Denver
  • The model is potentially learning too much about the specific city rather than about archetypes

The first issue is addressed by collapsing Atlanta, Dallas, and Miami and instead using Houston as the archetype; and replacing Minneapolis with Chicago:

archeCities_v02 <- c("kiah_2016", "klas_2016", "klax_2016", 
                     "kden_2016", "ksea_2016", "kord_2016"
                     )

archeData_v02 <- metarData %>%
    filter(source %in% archeCities_v02) %>%
    mutate(hr=lubridate::hour(dtime))

A random forest, with mtry=4, is then run using all variables from previous, as well as hour:

# Run random forest for all 2016 locales in archeData_v02
rf_arche_2016_TDmcwa_v02 <- rfMultiLocale(archeData_v02, 
                                          vrbls=c("TempF", "DewF", 
                                                  "month", "hr",
                                                  "minHeight", "ceilingHeight", 
                                                  "WindSpeed", "predomDir",
                                                  "modSLP"
                                                  ),
                                          locs=NULL, # defaults to running for all locales
                                          ntree=100, 
                                          mtry=4,
                                          seed=2006291448
                                          )
## 
## Running for locations:
## [1] "kden_2016" "kiah_2016" "klas_2016" "klax_2016" "kord_2016" "ksea_2016"

The evaluation process can again be run:

# Get the graphs of the prediction accuracy
evalPredictions(rf_arche_2016_TDmcwa_v02, 
                plotCaption = "Temp, Dew Point, Month/Hour, Cloud/Ceiling Height, Wind Speed/Direction, SLP"
                )

## # A tibble: 36 x 5
##    locale             predicted              correct     n     pct
##    <fct>              <fct>                  <lgl>   <int>   <dbl>
##  1 Chicago, IL (2016) Chicago, IL (2016)     TRUE     2321 0.872  
##  2 Chicago, IL (2016) Denver, CO (2016)      FALSE      99 0.0372 
##  3 Chicago, IL (2016) Houston, TX (2016)     FALSE      54 0.0203 
##  4 Chicago, IL (2016) Las Vegas, NV (2016)   FALSE      13 0.00489
##  5 Chicago, IL (2016) Los Angeles, CA (2016) FALSE      70 0.0263 
##  6 Chicago, IL (2016) Seattle, WA (2016)     FALSE     104 0.0391 
##  7 Denver, CO (2016)  Chicago, IL (2016)     FALSE     119 0.0453 
##  8 Denver, CO (2016)  Denver, CO (2016)      TRUE     2282 0.869  
##  9 Denver, CO (2016)  Houston, TX (2016)     FALSE       3 0.00114
## 10 Denver, CO (2016)  Las Vegas, NV (2016)   FALSE     126 0.0480 
## # ... with 26 more rows

The model is reasonably successful in pulling apart the archetypes. Houston appears the most distinct (as had Miami previously), and there is less overlap going to/from Denver or Seattle.

The remaining cities can be classified against the archetype data, to explore any patterns that emerge:

localeByArchetype(rf_arche_2016_TDmcwa_v02, 
                  fullData=metarData, 
                  archeCities=archeCities_v02, 
                  sortDescMatch = TRUE
                  )

At a glance, the archetypes are more sensible. There is a bit of Denver and a bit of Seattle in most of the cold weather cities, and a bit of Houston in some of the warmer cold weather cities. This is suggestive that there are either more then one type of cold-weather cities, or that cold-weather cities tend to be like other archetypes are certain times.

The model is run again, deleting Denver and Seattle, and converting Los Angeles to San Diego:

archeCities_v03 <- c("kiah_2016", "klas_2016", "ksan_2016", "kord_2016")

archeData_v03 <- metarData %>%
    filter(source %in% archeCities_v03) %>%
    mutate(hr=lubridate::hour(dtime))

A random forest, with mtry=4, is then run using all variables from previous, as well as hour:

# Run random forest for all 2016 locales in archeData_v03
rf_arche_2016_TDmcwa_v03 <- rfMultiLocale(archeData_v03, 
                                          vrbls=c("TempF", "DewF", 
                                                  "month", "hr",
                                                  "minHeight", "ceilingHeight", 
                                                  "WindSpeed", "predomDir",
                                                  "modSLP"
                                                  ),
                                          locs=NULL, # defaults to running for all locales
                                          ntree=100, 
                                          mtry=4,
                                          seed=2006291459
                                          )
## 
## Running for locations:
## [1] "kiah_2016" "klas_2016" "kord_2016" "ksan_2016"

The evaluation process can again be run:

# Get the graphs of the prediction accuracy
evalPredictions(rf_arche_2016_TDmcwa_v03, 
                plotCaption = "Temp, Dew Point, Month/Hour, Cloud/Ceiling Height, Wind Speed/Direction, SLP"
                )
## Warning: Removed 1 rows containing missing values (geom_text).

## # A tibble: 16 x 5
##    locale               predicted            correct     n     pct
##    <fct>                <fct>                <lgl>   <int>   <dbl>
##  1 Chicago, IL (2016)   Chicago, IL (2016)   TRUE     2511 0.946  
##  2 Chicago, IL (2016)   Houston, TX (2016)   FALSE      69 0.0260 
##  3 Chicago, IL (2016)   Las Vegas, NV (2016) FALSE      26 0.00979
##  4 Chicago, IL (2016)   San Diego, CA (2016) FALSE      49 0.0185 
##  5 Houston, TX (2016)   Chicago, IL (2016)   FALSE      44 0.0170 
##  6 Houston, TX (2016)   Houston, TX (2016)   TRUE     2473 0.954  
##  7 Houston, TX (2016)   Las Vegas, NV (2016) FALSE      19 0.00733
##  8 Houston, TX (2016)   San Diego, CA (2016) FALSE      57 0.0220 
##  9 Las Vegas, NV (2016) Chicago, IL (2016)   FALSE      29 0.0113 
## 10 Las Vegas, NV (2016) Houston, TX (2016)   FALSE      13 0.00508
## 11 Las Vegas, NV (2016) Las Vegas, NV (2016) TRUE     2482 0.971  
## 12 Las Vegas, NV (2016) San Diego, CA (2016) FALSE      33 0.0129 
## 13 San Diego, CA (2016) Chicago, IL (2016)   FALSE      35 0.0130 
## 14 San Diego, CA (2016) Houston, TX (2016)   FALSE      29 0.0108 
## 15 San Diego, CA (2016) Las Vegas, NV (2016) FALSE      63 0.0235 
## 16 San Diego, CA (2016) San Diego, CA (2016) TRUE     2559 0.953

These archetypes are distinct, with accuracies all in the 95% range. The remaining cities can be classified against the archetype data, to explore any patterns that emerge:

localeByArchetype(rf_arche_2016_TDmcwa_v03,
                  fullData=metarData,
                  archeCities=archeCities_v03,
                  sortDescMatch=TRUE
                  )

Quite a few cities show as blends of the archetype cities, which is likely correct if archetypes are considered to be Cold, Humid, Marine, Desert. An open question is whether to expand that list with another 1-2 archetypes that attempt to better capture Atlanta, Dallas, Denver, St Louis, and DC.

The Marine archetype may need to be converted to Pacific and to better identify the California entities.

A decision about Seattle is needed, as it is a little like everything and everything is a little like it.

To address the issue of the model learning from specific cities rather than archetypes, user-defined clusters will be created, and data from these clusters will be used in modeling:

  • Marine - San Diego (100%), Los Angeles (100%), San Francisco (50%), San Jose (50%) in an attempt to train the model to recognize that Marine cities differ meaningfully from Cold cities and Humid cities
  • Desert - Las Vegas, Phoenix in an attempt to train the model on differences in Desert and Marine
  • Humid - Houston, Miami, New Orleans, Tampa, reflecting that the model is already largely pulling out the humid cities correctly
  • Cold - Chicago, Milwaukee, Grand Rapids, Green Bay, Traverse City, Madison, Detroit, Minneapolis, Boston to reflect that the model is largely already pulling out cold cities correctly
  • Mixed - St Louis, DC, Dallas, Atlanta in an attempt to reflect that these cities are sometimes classified as Cold and sometimes classified as Humid
  • Excluded for now - Denver (mix of Cold/Desert, see how evolves), Seattle (mix of everything, see how evolves), Philadelphia/Newark/Lincoln/Indianapolis (see how evolves with split of Cold and Mixed), San Antonio (see how evolves with split of Humid, Cold, and Mixed)

Further, a variable will be created for “hr”, the Zulu hour of the observation:

locale2016Mapper <- c('Cold-MI', 'Newark', 'Cold', 'Cold-MI', 'Humid', 'Cold', 'Desert', 'Lincoln', 'Cold', 'Cold', 'Cold', 'Humid', 'Cold', 'Marine', 'Cold-MI')
names(locale2016Mapper) <- c('Detroit, MI (2016)', 'Newark, NJ (2016)', 'Green Bay, WI (2016)', 'Grand Rapids, MI (2016)', 'Houston, TX (2016)', 'Indianapolis, IN (2016)', 'Las Vegas, NV (2016)', 'Lincoln, NE (2016)', 'Milwaukee, WI (2016)', 'Madison, WI (2016)', 'Minneapolis, MN (2016)', 'New Orleans, LA (2016)', 'Chicago, IL (2016)', 'San Diego, CA (2016)', 'Traverse City, MI (2016)')

tibble::tibble(locale=names(locale2016Mapper), mapping=locale2016Mapper) %>%
    arrange(mapping, locale)

modData <- modData %>%
    mutate(hr=lubridate::hour(dtime), 
           locType=locale2016Mapper[locale]
           )

modData %>%
    count(locType, year) %>%
    pivot_wider(locType, names_from="year", values_from="n")

The data are the filtered so that there are an equal number of observations from each locale type. The model can later be predicted against the remaining items:

# Set a seed for reporducibility
set.seed(2006211352)

# Find the smallest locale type
nSmall <- modData %>%
    filter(!is.na(locType)) %>%
    count(locType) %>%
    pull(n) %>%
    min()

# Create the relevant data subset
subData <- modData %>%
    filter(!is.na(locType)) %>%
    group_by(locType) %>%
    sample_n(size=nSmall, replace=FALSE) %>%
    ungroup()

# Sumarize the data subset
subData %>% 
    count(locale) %>% 
    arrange(-n)

The previous model can then be run on the data subset:

# Run random forest for all 2016 locale types
rf_types_2016_TDmcw <- rfMultiLocale(subData, 
                                     vrbls=c("TempF", "DewF", 
                                             "month", 
                                             "minHeight", "ceilingHeight", 
                                             "WindSpeed", "predomDir"
                                             ),
                                     locs=NULL, 
                                     locVar="locType",
                                     pred="locType",
                                     ntree=50, 
                                     seed=2006211401
                                     )

The evaluation process can again be run:

evalPredictions(rf_types_2016_TDmcw, 
                plotCaption = "Temp, Dew Point, Month, Cloud/Ceiling Height, Wind Speed/Direction", 
                keyVar="locType"
                )

Predictive ability is strongest for the well-differentiated types, and weakest for the more poorly differentiated types. The model is about 50% accurate in classifying cold climates; 25% of the time they are classified as they other type of cold climate, and 25% of the time they are classified as something else (typically Lincoln or Newark).

The modeling is run again using a much smaller training dataset (testSize set to 0.9) to see the implication of a much smaller data volume:

# Run random forest for all 2016 locale types
rf_types_2016_small_TDmcw <- rfMultiLocale(subData, 
                                           vrbls=c("TempF", "DewF", 
                                                   "month", 
                                                   "minHeight", "ceilingHeight", 
                                                   "WindSpeed", "predomDir"
                                                   ),
                                           locs=NULL, 
                                           locVar="locType",
                                           pred="locType",
                                           ntree=50, 
                                           seed=2006211419,
                                           testSize=0.9
                                           )

The evaluation process can again be run:

evalPredictions(rf_types_2016_small_TDmcw, 
                plotCaption = "Temp, Dew Point, Month, Cloud/Ceiling Height, Wind Speed/Direction", 
                keyVar="locType"
                )

As expected, predictions are less accurate with a smaller training dataset. Overall acuuracy falls from ~70% to ~60% with the much smaller training data volumes. This suggests the model is continuing to learn from the data, and may benefit from an expanded sample.

Next, the model is adapted to include the hour (as an integer, which may need to be rethought, though this will generally capture whether it is daytime or nighttime even without factoring) and the sea-level pressure variable:

# Run random forest for all 2016 locale types - add hour and modSLP, limit mtry to 4
rf_types_2016_TDmcwha <- rfMultiLocale(subData, 
                                       vrbls=c("TempF", "DewF", 
                                               "month", "hr",
                                               "minHeight", "ceilingHeight", 
                                               "WindSpeed", "predomDir", 
                                               "modSLP"
                                               ),
                                       locs=NULL, 
                                       locVar="locType",
                                       pred="locType",
                                       ntree=50, 
                                       seed=2006211423, 
                                       mtry=4
                                       )

The evaluation process can again be run:

evalPredictions(rf_types_2016_TDmcwha, 
                plotCaption = "Temp, Dew Point, Month, Hour of Day, Cloud Height, Wind, SLP", 
                keyVar="locType"
                )

Accuracy increases to 75%. In particular, the model is better able to distinguish Lincoln and Newark from the remaining locations, which has a follow-on effect of improving cold weather classifications.

Exploration is then run on distinguishing two locales from each other, to see the implications of parameters like data volume (controlled through testSize) and the number of trees.

Comparisons will be run as follows:

  • Desert vs. Humid (should be easiest to pull apart)
  • Cold vs. Cold-MI (should be hardest to pull apart)
ntrees <- c(10, 25, 50, 100, 250, 500)
testSizes <- c(0.9, 0.7, 0.3)

desertHumid <- vector("list", length(ntrees) * length(testSizes))
n <- 1

for (ntree in ntrees) {
    for (testSize in testSizes) {

        desertHumid[[n]] <- rfTwoLocales(subData, 
                                         loc1="Desert", 
                                         loc2="Humid", 
                                         locVar="locType", 
                                         vrbls=c("TempF", "DewF", "month", "hr", 
                                                 "minHeight", "ceilingHeight", 
                                                 "WindSpeed", "predomDir", "modSLP"
                                                 ),
                                         pred="locType", 
                                         seed=2006211432,
                                         ntree=ntree, 
                                         mtry=4,
                                         testSize=testSize
                                         )
        desertHumid[[n]]$ntree <- ntree
        desertHumid[[n]]$testSize <- testSize
        
        n <- n+1
        
    }
}

An assessment of accuracy can then be performed:

df <- sapply(desertHumid, 
             FUN=function(x) { c(x[["errorRate"]]["OOB"], ntree=x$ntree, testSize=x$testSize) }
             ) %>%
    t() %>%
    tibble::as_tibble()

df %>%
    mutate(accuracy=1-OOB, trainSize=1-testSize) %>%
    ggplot(aes(x=ntree, y=trainSize)) +
    geom_text(aes(label=paste0(round(100*accuracy), "%"))) + 
    scale_x_log10()

As expected, accuracy is very high for distinguishing Desert and Humid climates, even with small data volumes (training size 10% of the data) and a small number of trees.

The approach is then run for the cold weather cities:

ntrees <- c(10, 25, 50, 100, 250, 500)
testSizes <- c(0.9, 0.7, 0.3)

coldColdMI <- vector("list", length(ntrees) * length(testSizes))
n <- 1

for (ntree in ntrees) {
    for (testSize in testSizes) {

        coldColdMI[[n]] <- rfTwoLocales(subData, 
                                        loc1="Cold", 
                                        loc2="Cold-MI", 
                                        locVar="locType", 
                                        vrbls=c("TempF", "DewF", "month", "hr", 
                                                "minHeight", "ceilingHeight", 
                                                "WindSpeed", "predomDir", "modSLP"
                                                ),
                                        pred="locType", 
                                        seed=2006211501,
                                        ntree=ntree, 
                                        mtry=4,
                                        testSize=testSize
                                        )
        coldColdMI[[n]]$ntree <- ntree
        coldColdMI[[n]]$testSize <- testSize
        
        n <- n+1
        
    }
}

An assessment of accuracy can then be performed:

dfColdColdMI <- sapply(coldColdMI, 
                       FUN=function(x) { c(x[["errorRate"]]["OOB"], ntree=x$ntree, testSize=x$testSize) }
                       ) %>%
    t() %>%
    tibble::as_tibble()

dfColdColdMI %>%
    mutate(accuracy=1-OOB, trainSize=1-testSize) %>%
    ggplot(aes(x=ntree, y=trainSize)) +
    geom_text(aes(label=paste0(round(100*accuracy), "%"))) + 
    scale_x_log10()

Greater data volumes and greater tree depths are more helpful for distinguishing cold weather city types from each other. The smallest forest has a 56% accuracy (null 50%) while the largest has a 68% accuracy. This suggests some potential upside to continuing the cold weather modeling process with more cities added for greater data volumes.

Data from years other than 2016 can be fed in to the model, with predictions made to see whether the model is learning general differences in climate or specific features about 2016.

Data that were not included in the modeling can be predicted, with the results assessed:

  • 2016 data that was not chosen for modeling - serves as an additional test frame
  • 2015 and 2017 data - can show the extent to which the model is learning general climate differences as opposed to specific 2016 weather features

Example code includes:

nonUsedData <- modData %>%
    anti_join(select(subData, source, dtime))

nonUsedData %>%
    mutate(localeName=str_replace(locale, pattern=" .{1}\\d{4}.*", replacement="")) %>%
    count(localeName, year) %>%
    pivot_wider(localeName, names_from="year", values_from="n")

The function can then be applied to the 2016 data:

# Modify the file for mapping locale to type
localeMapper <- locale2016Mapper
names(localeMapper) <- str_replace(names(localeMapper), pattern=" .\\d{4}.", replacement="")

# Predictions on 2016 data
helperPredictPlot(rf_types_2016_TDmcwha$rfModel, 
                  df=filter(nonUsedData, year==2016), 
                  predOrder=c("Cold", "Cold-MI", "Lincoln", "Newark", "Marine", "Humid", "Desert"), 
                  locMapper=localeMapper
                  )

The predictions seem in-line with the test dataset conclusions, as would be expected given that the “unused” 2016 data are randomly sampled and should be no different than that “test” 2016 dataset.

The function can then be applied to the 2015 and 2017 data:

# Predictions on 2015/2017 data
helperPredictPlot(rf_types_2016_TDmcwha$rfModel, 
                  df=filter(nonUsedData, year!=2016), 
                  predOrder=c("Cold", "Cold-MI", "Lincoln", "Newark", "Marine", "Humid", "Desert"), 
                  locMapper=localeMapper
                  )

Model performance on 2015 and 2017 data is not as strong, with roughly a 10%-20% loss of accuracy. Predictions are still much better than null accuracy, and the model (mostly) continues to separate the three well-differentiated types (Desert, Humid, Marine).

However, the model struggles to classify Chicago, placing it roughly equally as Cold (correct), Cold-MI, Lincoln, and Newark. This suggests cold-weather cities may be predicted on specific anomalies of 2016 (e.g., where the cold snaps hit). More data may help the model generalize better, specifcally to separate recurring and generalizable climate features of Chicago from weather anomalies that happened to hit Chicago in 2016.

Suppose that models are run on all 2015-2017 data for Chicago, Las Vegas, New Orleans, and San Diego:

# Create the subset for Chicago, Las Vegas, New Orleans, San Diego (should have 2015, 2016, 2017)
sub_2015_2017_data <- modData %>%
    filter(str_sub(source, 1, 4) %in% c("kord", "klas", "kmsy", "ksan")) %>%
    mutate(city=str_replace(locale, pattern=" .\\d{4}.", replacement=""))

# Check that proper locales are included
sub_2015_2017_data %>% 
    count(city, locale)
# Run random forest for 2015-2017 data
rf_types_2015_2017_TDmcwha <- rfMultiLocale(sub_2015_2017_data, 
                                            vrbls=c("TempF", "DewF", 
                                                    "month", "hr",
                                                    "minHeight", "ceilingHeight", 
                                                    "WindSpeed", "predomDir", 
                                                    "modSLP"
                                                    ),
                                            locs=NULL, 
                                            locVar="city",
                                            pred="city",
                                            ntree=25, 
                                            seed=2006221334, 
                                            mtry=4
                                            )
evalPredictions(rf_types_2015_2017_TDmcwha, 
                plotCaption = "Temp, Dew Point, Month, Hour of Day, Cloud Height, Wind, SLP", 
                keyVar="city"
                )

Even with a very small forest (25 trees), the model is almost always separating Las Vegas, Chicago, San Diego, and New Orleans. While the climates are very different in these cities, it is striking that the model has so few misclassifications.

How do other cities map against these classifications?

# Predictions on 2015/2017 data
helperPredictPlot(rf_types_2015_2017_TDmcwha$rfModel, 
                  df=filter(modData, !(str_sub(source, 1, 4) %in% c("kord", "klas", "kmsy", "ksan"))), 
                  predOrder=c("Chicago, IL", "San Diego, CA", "New Orleans, LA", "Las Vegas, NV")
                  )
  • Houston is most similar to New Orleans as expected
  • Milwaukee, Green Bay, and Grand Rapids are all classified 90%+ as Chicago
  • Madison, Traverse City, Minneapolis, and Detroit are all classified 85%+ as Chicago
  • Lincoln, Indianapolis, and Newark are all “closest” to Chicago, but all show meaningful prediction volumes to New Orleans (these are the three “least cold” of the cold-weather cities in the modeling)

The prediction process is re-run excluding the altimeter (modSLP) data:

# Run random forest for 2015-2017 data (exclude modSLP)
rf_types_2015_2017_TDmcwh <- rfMultiLocale(sub_2015_2017_data, 
                                           vrbls=c("TempF", "DewF", 
                                                   "month", "hr",
                                                   "minHeight", "ceilingHeight", 
                                                   "WindSpeed", "predomDir"
                                                   ),
                                            locs=NULL, 
                                            locVar="city",
                                            pred="city",
                                            ntree=25, 
                                            seed=2006221334, 
                                            mtry=4
                                            )
evalPredictions(rf_types_2015_2017_TDmcwh, 
                plotCaption = "Temp, Dew Point, Month, Hour of Day, Cloud Height, Wind", 
                keyVar="city"
                )

Performance drops, since altimeters are meaningfully different in high altitude (Las Vegas) and sea-level (New Orleans and San Diego) locations.

Variable importances are plotted:

helperPlotVarImp(rf_types_2015_2017_TDmcwha$rfModel)
helperPlotVarImp(rf_types_2015_2017_TDmcwh$rfModel, titleAdd=" (Excludes SLP)")

Dew point and temperature by month continue to be strong factors for separating the four cities in this analysis. SLP, minimum cloud height, and prevailing wind direction are also meaningful.

The assessment can be run for the two main 2015-2017 models:

# Run for the full model including SLP
probs_2015_2017_TDmcwha <- 
    assessPredictionCertainty(rf_types_2015_2017_TDmcwha, 
                              keyVar="city", 
                              plotCaption="Temp, Dew Point, Month/Hour, Clouds, Wind, SLP", 
                              showAcc=TRUE
                              )

# Run for the model excluding SLP
probs_2015_2017_TDmcwh <- 
    assessPredictionCertainty(rf_types_2015_2017_TDmcwh, 
                              keyVar="city", 
                              plotCaption="Temp, Dew Point, Month/Hour, Clouds, Wind", 
                              showAcc=TRUE
                              )

Predictions with 80%+ of the votes are made ~80% of the time, and these predictions are ~99% accurate. Predictions with <80% of the votes are made ~20% of the times, and these predictions are ~75% accurate. The percentage of votes received appears to be a reasonable proxy for the confidence of the prediction.

A similar process can be run for assessing the classification of the other cities against the 2015-2017 data for Chicago, Las Vegas, New Orleans, and San Diego:

useData <- modData %>%
    filter(!(str_sub(source, 1, 4) %in% c("kord", "klas", "kmsy", "ksan")))
    
# Run for the model excluding SLP
probs_allcities_2015_2017_TDmcwh <- 
    assessPredictionCertainty(rf_types_2015_2017_TDmcwh, 
                              testData=useData,
                              keyVar="locale", 
                              plotCaption="Temp, Dew Point, Month/Hour, Clouds, Wind", 
                              showHists=TRUE
                              )

Patterns in the predictions stand out:

  • Houston is almost always classified as New Orleans when the model is confident (~65% of the time)
  • The cities that are at least as cold as Chicago are almost always classified as Chicago when the model is confident (~70% of the time)
  • Newark, Lincoln, and Indianapolis are generally not classified with confidence (~50% of predictions are to ‘Too Low’), suggesting these cities may fall in between the four archetypes
  • There are a meaningful number of ‘Too Low’ predictions for every locale, suggesting that many locales “loo like” at least two of these archetypes at certain times of the year

The actual run is cached since it takes a meaningful amount of time:

# Define the variables to be considered
possVars <- c("TempF", "DewF", "month", "hr", "minHeight", 
              "ceilingHeight", "WindSpeed", "predomDir", "modSLP"
              )

# Create a container for storing the best random forest and variable added from each run
rfContainer <- vector("list", length=length(possVars))
rfVarAdds <- vector("character", length=length(possVars))

# Run the functions using a for loop over the length of possVars
for (ctr in 1:length(possVars)) {
    
    # Pull in the results of the previous run
    if (ctr==1) {
        prevVars <- c()
    } else {
        prevVars <- rfVarAdds[1:(ctr-1)]
    }
    
    # Run each of them through the combinations
    tmpList <- helperRFCombinations(possVars, df=sub_2015_2017_data, prevVars=prevVars)

    # Assess the performance
    tmpAccuracy <- helperVariableAccuracy(tmpList, possVars=possVars, prevVars=prevVars)
    
    # Prepare to repeat the process
    bestRow <- tmpAccuracy %>%
        filter(locale=="OOB") %>%
        filter(accuracy==max(accuracy))
    bestRow
    
    # Update the rfContainer and rfVarAdds elements
    rfContainer[[ctr]] <- tmpList[[as.integer(pull(bestRow, vrblNum))]]
    rfVarAdds[ctr] <- pull(bestRow, vrblName)
    cat("\nVariable Added:", rfVarAdds[ctr], "\n")

}

The evolution of accuracy can then be plotted:

# Pull the accuracy data from the variables selected
tblAccuracy <- map_dfr(rfContainer, .f=helperExtractAccuracy, .id="vrblNum") %>%
    mutate(vrblName=rfVarAdds[as.integer(vrblNum)], 
           locale=factor(locale, levels=c("OOB", unique(locale)[unique(locale) != "OOB"])), 
           plotLabel=ifelse(as.integer(vrblNum)==1, vrblName, paste0("add ", vrblName))
           )

# Plot the gains in accuracy, facetted by 'locale'
ggplot(tblAccuracy, aes(x=fct_reorder(plotLabel, -as.integer(vrblNum)))) + 
    geom_text(aes(y=accuracy, label=paste0(round(100*accuracy), "%"))) + 
    facet_wrap(~locale, nrow=1) + 
    ylim(c(0, 1)) +
    geom_hline(yintercept=0.25, lty=2) +
    labs(x="", 
         y="", 
         title="Evolution of accuracy as next-best variables are added",
         caption="Null accuracy is 25%"
         ) +
    coord_flip()

As seen previously, dew point, temperature, and month are significant differentiating variables, driving roughly 75% accuracy in classifying the archetypes.

Adding altimeter (SLP) bumps the accuracy by another ~10%, while adding minimum cloud height and prevailing wind direction bump the accuracy by another ~5%.

This shows some of the power of the random forest algorithm as it is given additional variables to explore.

Further, the OOB estimates provided by the modeling are pessimistic relative to performance on the hold-out sample (which has 94%-96% accuracy rather than the estimated 89%-93% accuracy). This is driven partly by rfTwoLocales() pulling the average error rate rather than the final error rate, an oversight that carries throughout the analysis. The next version of this modeling should fix the approach to classifying errors.

Evolution of error can be plotted to see the impact:

oobError <- c()
for (ctr in 1:9) {
    oobError <- c(oobError, rfContainer[[ctr]]$rfModel$err.rate[, "OOB"])
}

tibble::tibble(nVars=rep(1:9, each=25), 
               ntrees=rep(1:25, times=9), 
               accuracy=1-oobError
               ) %>%
    ggplot(aes(x=ntrees, y=accuracy)) + 
    geom_line(aes(group=nVars, color=factor(nVars))) +
    labs(x="Number of Trees", 
         y="OOB Accuracy", 
         title="Accuracy improves with more trees and more variables"
         ) + 
    scale_color_discrete("# Variables")

It will be interesting to see how the four archetypes models perform on other large US cities. A separate routine is written to source the key download and EDA functions and to run them for these key data.